Authorship identification from unstructured texts
نویسندگان
چکیده
Authorship identification is a task of identifying authors of anonymous texts given examples of the writing of authors. The increasingly large volumes of anonymous texts on the Internet enhance the great yet urgent necessity for authorship identification. It has been applied to more and more practical applications including literary works, intelligence, criminal law, civil law, and computer forensics. In this paper, we propose a semantic association model about voice, word dependency relations, and non-subject stylistic words to represent the writing style of unstructured texts of various authors, design an unsupervised approach to extract stylistic features, and employ principal components analysis and linear discriminant analysis to identify authorship of texts. This paper provides a uniform quantified method to capture syntactic and semantic stylistic characteristics of and between words and phrases, and this approach can solve the problem of the independence of different dimensions to some extent. Experimental results on two English text corpora show that our approach significantly improves the overall performance over authorship identification. 2014 Elsevier B.V. All rights reserved.
منابع مشابه
Text Classification For Authorship Attribution Analysis
Authorship attribution mainly deals with undecided authorship of literary texts. Authorship attribution is useful in resolving issues like uncertain authorship, recognize authorship of unknown texts, spot plagiarism so on. Statistical methods can be used to set apart the approach of an author numerically. The basic methodologies that are made use in computational stylometry are word length, sen...
متن کاملAuthorship Identification of Romanian Texts with Controversial Paternity
In this work we propose a new strategy for the authorship identification problem and we test it on an example from Romanian literature: did Radu Albala found the continuation of Mateiu Caragiale’s novel ”Sub pecetea tainei”, or did he write himself the respective continuation? The proposed strategy is based on the similarity of rankings of function words; we compare the obtained results with th...
متن کاملAuthorship Identification for Heterogeneous Documents
The study of authorship identification in Japanese has for the most part been restricted to literary texts using basic statistical methods. In the present study, authors of mailing list messages are identified using a machine learning technique (Support Vector Machines). In addition, the classifier trained on the mailing list data is applied to identify the author of Web documents in order to i...
متن کاملText Authorship Identified Using the Dynamics of Word Co-Occurrence Networks
Automatic identification of authorship in disputed documents has benefited from complex network theory as this approach does not require human expertise or detailed semantic knowledge. Networks modeling entire books can be used to discriminate texts from different sources and understand network growth mechanisms, but only a few studies have probed the suitability of networks in modeling small c...
متن کاملKernel Methods and String Kernels for Authorship Analysis
This paper presents our approach to the PAN 2012 Traditional Authorship Attribution tasks and the Sexual Predator Identification task. We approached these tasks with machine learning methods that work at the character level. More precisely, we treated texts as just sequences of symbols (strings) and used string kernels in conjunction with different kernel-based learning methods: supervised and ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Knowl.-Based Syst.
دوره 66 شماره
صفحات -
تاریخ انتشار 2014